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ABSTRACT 

Motivation: Histone modifications regulate chromatin structure and 
gene expression. Although nucleosome formation is known to be af- 
fected by primary DNA sequence composition, no sequence signature 
has been identified for histone modifications. It is known that dense 
H3K4me3 nucleosome sites are accompanied by a low density of 
other nucleosomes and are associated with gene activation. This ob- 
servation suggests a different sequence composition of H3K4me3 
from other nucleosomes. 

Approach: To understand the relationship between genome se- 
quence and chromatin structure, we studied DNA sequences at his- 
tone modification sites in various human cell types. We found 
sequence specificity for H3K4me3, but not for other histone modifica- 
tions. Using the sequence specificities of H3 and H3K4me3 nucleo- 
somes, we developed a model that computes the probability of 
H3K4me3 occupation at each base pair from the genome sequence 
context. 

Results: A comparison of our predictions with experimental data sug- 
gests a high performance of our method, revealing a strong associ- 
ation between H3K4me3 and specific genomic DNA context. The high 
probability of H3K4me3 occupation occurs at transcription start and 
termination sites, exon boundaries and binding sites of transcription 
regulators involved in chromatin modification activities, including his- 
tone acetylases and enhancer- and insulator-associated factors. Thus, 
the human genome sequence contains signatures for chromatin modi- 
fications essential for gene regulation and development. Our method 
may be applied to find new sequence elements functioning by chro- 
matin modulation. 

Availability: Software and supplementary data are available at 
Bioinformatics online. 

Contact: misook.ha@samsung.com or wli@uchicago.edu 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Chromatin remodeling mediated by histone modifications is an 
important mechanism for specific gene expression (Ha et al., 
2011). However, the relationship between genome sequence 
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and chromatin remodehng is not well understood. Although 
DNA sequence preferences for nucleosome formation have 
been known since the mid 1980s (Drew and Travers, 1985; 
Struhl, 1985; Yuan et al., 2005), the role of histone-DNA inter- 
action in nucleosome formation is subjected to debate (Kaplan et 
al., 2009; Zhang et al., 2009). The interplay between transcription 
factors (TFs) and adenosine triphosphate-dependent chromatin 
modulating factors in regulating histone modifications imphes 
that histone modifications do not follow the sequence prefer- 
ences of the general H3 nucleosomes or the other nucleosomes 
that are not trimethylated at H3K4 (Berger, 2007; Jenuwein and 
AUis, 2001; Matthews et al., 2007). Indeed, the H3K4me3 nu- 
cleosome is preferentially enriched in the genomic regions show- 
ing a low density of other nucleosomes and marks an open 
chromatin region. The advance of DNA sequencing technology 
with immunoprecipitated DNA segment associated with specific 
histone modifications allows us to examine the DNA sequence 
compositions specific to histone modifications in an unbiased 
way. Therefore, we extracted DNA sequences of nucleosomes 
carrying individual histone modifications from ChlP-seq data 
(Barski et al, 2007; Cui et al, 2009; Hawkins et al, 2010; 
Lister et al, 2009; Wang et al, 2008, 2009b) and identified 
the features of sequences bound to histone modifications 
in ChlP-seq experiments. Using the identified sequence specifi- 
cities of H3K4me3 and H3 nucleosomes, we developed a model 
to compute the probabihties of H3K4me3 and H3 nucleosome 
occupation from the genome sequence context. H3K4me3 
ChlP-seq data in various human cells indicate that the loci pre- 
dicted by our model to have a high probability of the H3K4me3 
sequence signature are preferentially occupied by H3K4me3 nu- 
cleosome. Our study provides a method for investigating the 
DNA sequence features of chromatin structure. Furthermore, 
our analyses show that the human genome sequence contains 
signals for chromatin remodeling at epigenetic regulatory 
elements. 

2 METHODS 

2.1 ChlP-seq data sources 

The ChlP-seq data of H3 nucleosomes in two different conditions of 
human CD4 + T cells were obtained from Schones et al. (2008). The 
ChlP-seq data of CTCF, H3K4mel, H3K4me2, H3K4me3, 
H3K27me3, RNA Pol II in human CD4 + T cells, CD133+ and 
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CD36+ cells were obtained from Barski et al. (2007) and Cui et al. 
(2009). Methyl-bisulfite sequencing data and ChlP-seq data of various 
histone methylations in human embryonic stem cells (HESC) are from 
Lister et al. (2009) and Hawkins et al. (2010). The ChlP-seq data of 
histone acetylases, deacetylases and acetylations are from Wang et al. 
(2008, 2009b). 

2.2 Identification of sequence specificities of nucleosome 
modifications 

The 30-50 bp sequences from the ChlP-seq data are mapped to the 
February 2009 human reference sequence (GRCh37/hgl9) by perfect 
and unique match without allowing any mismatch or gap. To recover 
nucleosomal DNA fragments, each read was extended toward its 3' -end 
by 151 bp; the 30-50 bp reads are from the ends of both strands in nu- 
cleosome DNAs. The possible range of the ends of nucleosome-bound 
DNAs may be wider than that of micrococal nuclease-treated DNAs. 
Therefore, we consider 151 bp nucleosome-bound DNA regions, and 
from these regions, we estimate the frequencies of monomer, 2, 3, 4, 5 
and 6mer sequences for each histone modification and each cell type. The 
sequence frequencies are normalized by the sequence composition in the 
reference genome sequence (hgl9/GRCh37) to account for bias because 
of genome sequence composition. In this way, we estimate the enrichment 
of nucleosomes at every sequence composition. 

2.3 A probabilistic model of H3 and H3K4me3 
nucleosome occupation in a genome 

The occurrence of an H3 or H3K4me3 nucleosome at a genomic site can 
be affected by adjacent sequences or adjacent nucleosomes. Across a 
genome, we calculate the probability that a nucleosome (H3 or 
H3K4me3) locates at a locus by using the modified fifth order hidden 
Markov model (HMM) (Rabiner, 1989). This model considers possible 
arrangements in view of the competition with adjacent nucleosomes in 
evaluating the forward and backward status and sequence composition of 
the 147 bp sequence. 

2.3.1 Forward procedure Using the forward procedure, we calculate 
the probability of each status at the ith position considering the genome 
sequence from the first bp Si to S^. 

M-. The ith base pair in the DNA sequence of an H3K4me3 
nucleosome. 

N-. The ith base pair the DNA sequence of an H3 nucleosome. 

at{Mi): The probability that the sequence from 5*1 to St is observed, 
and St is the ith bp of the DNA sequence of an H3K4me3 nucleosome. 

oitiNi)'. The probability that the sequence from Si to St is observed, 
and St is the ith bp of the DNA sequence of an H3 nucleosome. 

at{d)\ The probability that the sequence from Si to St is observed, and 
St is not bound by any H3 or H3K4me3 nucleosome. 

(1) Initialization 

ai{d) = 7TdPd{Si) 

d: Depletion of all nucleosomes 
Ttd'. Proportion of nucleo some-depleted regions in the genome 
7Tn'. Proportion of nucleosome-bound regions in the genome 
tvm'- Proportion of H3K4me3 nucleosome-bound regions in the 
genome 

(2) Induction 

arid) = 71 d Pd{St)[at-i{d) + at-i{NiAi) + a,_i(Mi47)] 
at{Mi) = 7iM PM{St)[at-i{d) + at-i{Ni4i) + at-i^M^^)] 
at{M2) = at-i{Mi)PM{St\St-i) 



at{M^) = at-i(M2)PM(St I St-2, St-i) 

at{M,) = at-i(M,)PMiSt \ Sts, St-2, St-i) 

at(Ms) = at-i(M4)PM(St I ^.-4, ^/-3, ^.-2, St-i) 

at(Mi) = at-i(Mi_i)PM(Sr \ Sts, ^4, ^/-3, ^.-2, St-i), 6<i< 147 

at(Ni) = TIN PN{St)[0it-l{d) + Qf,_i(A^i47) + Of,_i(Mi47)] 

at{N2) = at-i{Ni)PN{St\St-i) 

at{N^) = at-iiN2)PNiSt I ^.-2, St-i) 

at{N,) = at-iiN3)PNiSt I ^.-3, ^.-2, ^-.-i) 

at(Ns) = at-i(N4)PN(St I ^.-4, ^r-3, St-2, St-i) 

atiNi) = at-iiNi_i)PNiSt I ^.-5, ^.-4, ^.-3, ^r-2, St-iX 6<i< 147 

(3) Termination 

P{Si, ... ,St\ Model) = arid) + o(t{Mj) + J2 o(t{Nj), where T is 
the total length of the genome. 

2.3.2 Backward procedure Using the backward procedure, we calcu- 
late the probability of observing the sequence, from St+i to the end of the 
genome with an H3K4me3 nucleosome at Sj = M,. 

l3t(Mi): Probability of the sequence from to the end of the genome 
when St = Mj. 

^t{Ni): Probability of the sequence from St+i to the end of the genome 
when St = Nj. 

^t{d): Probability of the sequence from St+\ to the end of the genome 
when St = d. 

(1) Initialization 

^j(d) = ^T(Mi) = ^T(Ni) =1, 1 < / < 147 

(2) Induction 

fitid) = 7Td Pd(St+lWt+lid) + 7TNPN(St+l)Pt+liNi) + 7TMPM(St+l)Pt+l(Mi) 

^t{Mt) = Pt+i(Mi+i)PM(St+i I St+2, St+3, St+4, St+5, St+el 1 < ^' < 141 

Pt(Mi42) = fit+i(Mi43)PM(St+l I St+2, St+3, St+4, St+5) 
^t{Mi43) = ^t+x{Mi44)PM{.St+l I St+2, St+3, St+4) 
^t{Mi44) = fit+l(Mi45)PM(St+l I St+2, St+3) 
^t{Mi45) = fit+liMi46)PM(St+l I St+2) 

PtiMue) = ^t+\{Mi4i)PM{St+i) 

^t{Mi4l) = 7Td Pd(St+l)l3t+l(d) + 7TNPN(St+l)^t+l(Nl) + JTMPM(St+l)^t+l(Mi) 

Pt(Ni) = fit+iiNi+i)PN(St+i I St+2, St+3, St+4, St+5, St+e), l<i< 141 

^t(Ni42) = ^^t+\(Ni43)PN(St+l I St+2, St+3, St+4, St+5) 
^t{Ni43) = ^t+\{Ni44)PN{,St+l I St+2, St+3, St+4) 
fitiNm) = fit+liNi45)PN(St+l I St+2, St+3) 
fitiNi45) = fit+liNi46)PN(St+l I St+2) 
^t(Ni46) = ^t+l(Ni47)PN(St+l) 

Pt(Ni47) = 7Td Pd(St+l)fit+lid) + 7TNPN(St+l)Pt+liNi) + 7TMPM(St+l)Pt+l(Mi) 

2.3.3 Integration of backward and forward probabilities The 
probability that the tth bp is at Mj is normalized by the sum of all con- 
figurations in our model. 

P(St = Mi\Su ... , St, Model) 

^ atjMtmMt) 

147 

at(d)Pt(d) + E [at(Nj)mj) + at(Mj)Pt(Mj)] 

.7=1 

2.3.4 The H3K4me3 sequence signature ( the probability of 
H3K4me3 occupation based on DNA primary sequence) The 
probability that a base pair position can potentially be covered by an 
H3K4me3 can be calculated by summing the probabilities from 
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Ml to Ml 47. 



H3 



H3K4me3 



P(St is covered by H3K4me3) 

147 

J2P{St = Mj\Su St, Model) 



7=1 



.7=1 



.;=i 

147 

P{St is covered by H3) = ^(^^ = A/,- 1 ^1 , . . . , ^r, Model) 

./=i 

147 

^ 7=1 

~ 147 

a,(d)pr(d) + E W)A(A^;) + a,(MjmMj)] 

.7=1 

2.4 //I vivo coverage of modifled nucleosomes at a 
base pair 

The ChlP-seq reads obtained from published data were mapped to the 
human genome sequence (hgl9). Each 30-50 bp read perfectly matching 
the genome sequence only once was extended to 151 bp from the 5'-end of 
the read because the immunoprecipitated DNA fragments are 150 bp in 
length. The coverage of each histone modification at a base pair position 
was estimated by the number of extended reads covering that position. As 
the coverage is dependent on the depth of ChlP-seq, we use correlation 
coefficients that are standardized by standard deviations. The compari- 
sons were made in the mappable genomic regions. 

2.5 Comparison of the probabilistic H3K4me3 
occupation map with in vivo data 

To validate the predicted H3K4me3 occupation map, we calculate the 
correlation coefficient between the probability of H3K4me3 occupation 
at each base pair in the human reference genome (hgl9/GRC37) and the 
in vivo coverage of H3K4me3 using ChlP-seq reads. 

2.6 Identification of TF-binding sites and inter-chromatin 
interaction sites 

The ER-a-bound human chromatin interaction sites were obtained from 
the ChlA-PET data in hgl8 (Fullwood et al., 2009) and transferred to 
hgl9 using liftOver. The AR- and FoxAl -binding sites in prostate cancer 
cells identified by Wang et al. (2009a) and Lupien et al. (2008) were 
transferred to hgl9 using liftOver. NANOG-, OCT-, SOX2-, KLF4- 
and TAFl -binding sites in embryonic stem cells identified by Lister 
et al. (2009) were moved to hgl9 using liftOver. 
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Fig. 1. Dinucleotide frequencies in H3 and H3K4me3 nucleosomes. The 
frequency of every dinucleotide in H3 and H3K4me3 nucleosome-bound 
genomic sequences was calculated at each position using 3 bp windows. 
The horizontal axis represents distance from the center of nucleosome- 
bound sequences 



H3 versus H3K4me3, CA/TG in H3 versus H3K4me3, AT in H3 
versus H3K4me3, GA/TC in H3 versus H3K4me3, TA in H3 
versus H3K4me3, AA/TT in H3 versus H3K4me3). 

The different dimer sequence frequencies between H3 and 
H3K4me3 nucleosomes imply that an n-mer (n>2) sequence 
composition can be a distinct feature of H3K4me3 nucleosomes. 
We studied 6mer sequences in histone methylations because 6mer 
sequences are long enough to represent sequence preferences, 
and the computation is still feasible. The distinct sequence pref- 
erences of H3K4me3 and H3 nucleosome are well reflected in 
6mer sequence specificities and are consistently found in various 
cell types (Fig. 2a and b, r > 0.98, P = 0). In contrast, H3K4mel 
and H3K27me3 nucleosomes show no consistent sequence pref- 
erences among cell types (Fig. 2d and e). The preferred 6mer 
DNA sequences of H3K4me3 show only a weak correlation 
with H3 nucleosome sequence preferences (Fig. 2c, r = 0.24, 
P = 0), indicating that H3K4me3 modification uses a mechanism 
different from that of any other nucleosome. 

To examine whether the sequence specificities differ between 
promoters and non-promoter regions, we calculated sequence 
specificities of histone modifications in the promoters and non- 
promoter regions. We found that the sequence specificities of 
histone modifications are highly significantly correlated between 
promoters and non-promoter regions (Supplementary Fig. S5), 
implying that the 6mer sequence specificities of H3K4me3 are 
consistently maintained across the genome. 



3 RESULTS 

3.1 Sequence specificities of nucleosome modifications 

We compared position- specific dinucleotide frequencies between 
H3 and H3K4me3 nucleosomal DNA sequences. H3K4me3- 
bound DNA sequences show a 10 bp periodicity of dinucleotide 
frequencies, but different amphtudes from H3 nucleosome- 
bound DNA (Supplementary Figs SI and S2) (Satchwell et ai, 
1986; Segal et al., 2006). In addition to the 10 bp periodicity, 
position-specific frequencies of each dinucleotide are within spe- 
cific discrete ranges and differ from those of H3 nucleosomes 
(Fig. 1; paired ^tests, P- values ~ 0 for CG frequencies in H3 
versus H3K4me3, GC in H3 versus H3K4me3, CC/GG in H3 
versus H3K4me3, AG/CT in H3 versus H3K4me3, AC/GT in 



3.2 A probabilistic model for an H3K4me3 occupation 
map in the human primary genome sequence 

Using the aforementioned sequence specificities of H3K4me3 
and H3 nucleosomes, we can compute the probability, 
P(H3K4me3 | S), that a given 147 bp or shorter sequence seg- 
ment (S) is an H3K4me3 nucleosome site, and also P(H3 | S) 
(Fig. 3). In principle, a base pair in the genome sequence can 
be the ith bp of the DNA sequence of a nucleosome 
(/= 1, . . . , 147). However, at each time at most only one nucleo- 
some can occur at a base position. To consider various possible 
arrangements of nucleosomes occupying a site, we constructed a 
modified 5th order (HMM of H3K4me3 and H3 nucleosome 
occupation (Fig. 3). Our model calculates the probability of 
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Fig. 2. H3K4me3 shows consistent and distinct sequence specificity 
among various human cell types, whereas other modified nucleosomes 
do not. (a-e) The correlations of the 6mer specificities in H3K4me3 nu- 
cleosomes (a), in H3 nucleosomes (b), between H3K4me3 and H3 nu- 
cleosomes (c), in H3K4mel nucleosomes (d) and in H3K27me3 
nucleosomes (e). The nucleosome-bound DNAs were identified using 
the ChlP-seq data from various cell types: CD4 + T cells in resting 
state (CD4 + Rest), CD4 + T cells in activated state (CD4 + Act), 
CD36 + erythrocyte stem cell (CD36 + ), CD133 + , HESC and HeLa 
cells. The values in the squares represent correlation coefficients of 
6mer sequence preference between two cell types. The red stars in a 
square signify that the correlation is highly significant, with the P-value 
close to 0. The vertical and horizontal axes represent the number of ChlP- 
seq reads covering the 6mer normalized by the number of 6mers in the 
human genome. Each point represents the enrichment of a 6mer in the 
ChlP-seq experiments in two cell types. The cell types on the x-axis and 
y-axis are marked on the main diagonal 



every possible arrangement of H3K4me3 and H3 nucleosomes 
on the whole genome. It distinguishes between H3K4me3 and 
H3 nucleosomes and integrates all possible arrangements based 
on forward and backward sequences of every base pair (Fig. 3). 
Ultimately, our model calculates the sum of the probabihties of 
all possible H3K4me3 and H3 nucleosomes that can potentially 
cover the base pair (see Section 2). If the loci associated with a 
high probability of H3K4me3 nucleosome occupation becomes 
significantly long, the genomic region may be considered to be 
fuzzy nucleosome sites as many previous research articles did, 
including Yuan et al (2005) and Segal et al (2006) 
(Supplementary Fig. S6). 

3.3 Association of the genome sequence context and the 
H3K4me3 occupation 

To examine the association of primary genome sequence context 
and H3K4me3 distribution, we compared our sequence-based 
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Fig. 3. Flow chart to compute the probabilities of H3K4me3 and H3 
nucleosome occupation at each base pair in the human genome, (a) First, 
the sequence specificities of H3K4me3 and H3 nucleosomes on 6mer 
sequences are inferred using in vivo ChlP-seq data. Second, the sequence 
specificities are used to compute the probability that a given sequence S of 
147 bp is a potential H3K4me3 nucleosome site and the corresponding 
probability for an H3 nucleosome site without H3K4me3. Third, a mod- 
ified fifth order HMM is constructed to compute the probabilities that a 
given base pair in the human genome is covered by an H3K4me3 nucleo- 
some or an H3 nucleosome. (b) The probability of occupation at a base 
pair is the sum of the occupation probabilities of all the nucleosomes that 
can occupy the base pair 



prediction of H3K4me3 occupation of a locus with the occupa- 
tion level from ChlP-seq experiments. As ChlP-seq experiments 
cannot provide 1 bp resolution localization of nucleosomes, we 
use the probabilistic occupation level of H3K4me3 nucleosome 
and the number of ChlP-seq reads covering the locus. We found 
several lines of evidence for a significant correlation between our 
sequence-based prediction of H3K4me3 occupation and the ex- 
perimental data. First, the 5'-ends of the E2F2 and GAPDH 
genes have been experimentally shown to carry a high level of 
H3K4me3 (Steward et al, 2006), and our model indeed predicts 
that these genomic regions have a high probabihty of H3K4me3 
occupation (E2F2, r = 0.73, ^ = 3x10"^ bp; GAPDH, r = 0.76, 
n = 6x 10^ bp) (Fig. 4a and b). The correlation coefficients of 
coverage at a base pair resolution are within the ranges observed 
in biological rephcations (Leleu et al, 2010). 

Second, we compare the probability of a base pair being cov- 
ered by H3K4me3 and the number of reads covering the base 
pair in ChlP-seq experiments in various cell types. The probabil- 
ity correlates well with the in vivo occupancy of H3K4me3 at 
each base pair in the 2.5 x 10^ bp mappable genomic region, 
estimated from ChlP-seq data in diverse cell types, including 
embryonic stem cells (HESC, r = 0.83, F = 0, n = 2.5x 10%p), 
fetal fibroblast cells (IMR90, r = 0.63, P = 0, « = 2.5x 10%p), 
CD 1 33 + hematopoietic stem cells (r = 0.55, P = 0, 
n = 2.5 X \{f bp), CD36 + erythrocyte precursors (r = 0. 54, 
P = 0, n = 2.5x 10%p), HeLa (r = 0.52, F = 0, n = 2.5x 10%p) 
and resting and activated CD4 + T helper cells (r = 0.46, P = 0, 
n = 2x 10^ bp) (Fig. 4c). The enrichment of ChlP-seq reads pro- 
vides probabihstic occupation level as well. In addition, there is 
possible variation in ChlP-seq experiments and differential regu- 
lation of H3K4me3 among cell types (Supplementary Fig. S3). 
The correlations between our predictions and the in vivo data are 
surprisingly high, suggesting a high accuracy of our model. 
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Fig. 4. Computed probabilities of H3K4me3 occupation recapitulate 
in vivo occupancies of H3K4me3 across various cell types, (a and b) 
The 5' -end of E2F2 and GAPDH genes have been experimentally vali- 
dated to have a very high level of H3K4me3 using western blot (Steward 
et al., 2006). Arrow indicates the direction of transcription. The level of 
black bar is the probability of H3K4me3 occupation at a base pair. 
A gray box represents the genomic region containing previously validated 
high-density H3K4me3 nucleosomes. (c) H3K4me3 ChlP-seq experimen- 
tal occupancy is correlated with probabiHty of H3K4me3 nucleosome 
occupation based on the human genome among cell types. The gray 
lines represent the 99% confidence interval of the mean ChlP-seq read 
occupancy, (d) Precision-recall analyses of P(H3K4me3 occupation). The 
dots mark cut-off probability of H3K4me3 occupation at 0.1-0.9 
increased by 0.1 



Third, we conducted precision/recall analyses to assess the ef- 
ficiency of our method. The precision level is the proportion of 
genomic regions occupied by H3K4me3 from the ChlP-seq ex- 
periment among the predicted genomic regions at an occupation 
probabihty cut-off. The recall level is the proportion of genomic 
regions predicted to be occupied by an H3K4me3 nucleosome 
among the genomic regions occupied by H3K4me3 from the 
ChlP-seq experiment. The experimentally verified positive loci 
associated with enrichment of H3K4me3 occupation were ex- 
tracted from the H3K4me3 ChlP-seq experiment in CD4 + T 
cells (Poisson process, P< 10~^). With the P(H3K4me3 occupa- 
tion) 0.5 cut-off, the precision level was 76% and the recall level 
was 37% in CD4 + T cells (Fig. 4d). As only 5% of the human 
genome is actually occupied by an H3K4me3 nucleosome in 
CD4 + T cells, our prediction of H3K4me3 occupation efficiently 
enriches the true H3K4me3 occupation sites [x^ test, H3K4me3 
occupation sites in the genome (5%) versus H3K4me3 occupa- 
tion sites among P(H3K4me3 occupation) > 0.5 (76%), P = 0]. 
In view of the fact that only 1 % of the genome is predicted with 
P(H3K4me3 occupation) > 0.5, the enrichment of 37% true posi- 
tive in P(H3K4me3 occupation) is highly significant [x^ test, 
P(H3K4me3 occupation) >0.5 in total genome (1%) versus 
P(H3K4me3 occupation) >0.5 among true H3K4me3 occupa- 
tion sites (37%), P = 0]. The precision-recall analyses suggest 
that our method efficiently predicts H3K4me3 nucleosome 
occupations. 
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Fig. 5. The probability of H3K4me3 occupation on regulatory elements 
in the human genome, (a) The probabilities of H3K4me3 occupation on 
regulatory elements in the human genome. The red dotted line represents 
the average probability of H3K4me3 occupation in the whole genome. 
The vertical axes are in the logarithmic scale, (b) The false negative rates 
at regulatory elements represent the occupation of H3K4me3 not follow- 
ing the high probability of H3K4me3 occupation. Out of the sites with 
the probability of H3K4me3 occupation <10~^, the proportion of sites 
with the top 0.1% ChlP-seq reads in CD4 + T cells were calculated in a 
200 bp window for each regulatory element. Asterisks represent statistical 
significance, P< 10~^ 



3.4 The probability of H3K4me3 occupation at regulator- 
chromatin interaction sites 

To investigate the role of the predicted H3K4me3 occupation in 
transcription, we studied its predicted occurrence around a tran- 
scribed region. We found a high peak of the probabihty of 
H3K4me3 occupation at the TSS and TTS of genes and at the 
start and end sites of exons (Fig. 5a). To study the probabihstic 
H3K4me3 occupation profile in higher- order chromatin-chro- 
matin interaction, we examined the relationship between the 
probability of H3K4me3 occupation and protein factors regulat- 
ing chromatin structure. Histone acetylases help long-range chro- 
matin interactions between enhancer-bound TFs and the RNA 
Pol II complex at promoters and enhancers. Insulators are pref- 
erentially bound by CTCF and involved in gene regulation and 
chromatin structural changes. We calculated the read coverage 
from ChlP-seq of RNA Pol II to estimate the distribution of 
RNA Pol II on the human genome. The probabihty of 
H3K4me3 occupation is significantly correlated with binding 
of RNA Pol II (Fig. 5a). Using the ChlP-seq data of various 
histone acetylases, deacetylases and CTCF, we estimated their 
binding preferences by the number of reads covering a base pair. 
The histone acetylases included in the analyses are p300, CBP, 
MOF, PCAF and Tip60, and the histone deacetylases included 
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are HDACl, HDAC2, HDAC3 and HDAC6. We found that 
histone acetylases, deacetylases and CTCF preferentially interact 
with the genome regions showing high probability of H3K4me3 
(Fig. 5a). Likewise, the in vivo levels of histone acetylations sig- 
nificantly correlate with the probability of H3K4me3 occupation. 
The significant co-localizations of various histone acetylations on 
the predicted H3K4me3 sites suggest that the high probability of 
H3K4me3 occupation represents the sequence context of histone 
modifications at regulatory elements. Note that p300, a well- 
known enhancer-binding protein (Visel et al., 2009), preferen- 
tially binds the genomic region associated with high probability 
of H3K4me3 occupation. These observations suggest that the 
sequence context encoding high probability of H3K4me3 occu- 
pation is specifically distributed across the human genome. In 
particular, the probability of H3K4me3 occupation is correlated 
with protein factors that affect chromatin structure. 

In general, we found many TF-binding sites associated with 
the high probability of H3K4me3 occupation (Supplementary 
Tables SI and S2). The OCT4-, NANOG-, SOX2- and KLF4- 
binding sites identified by ChlP-seq in HESC (Lister et al, 2009) 
show high probability of the H3K4me3 occupation. These TFs 
are key regulators of stem cell identity and closely interact with 
histone modifications (Boyer et al, 2005). The results suggest 
that cell-type-specific TFs interact with different patterns of 
the probability of H3K4me3 occupation. 

Finally, we estimated the proportion of H3K4me3 nucleosome 
occupied sites not following the probability of H3K4me3 occu- 
pation. We calculated the false negative rates of the probabiHstic 
occupation map of H3K4me3 at regulatory elements (Fig. 5b). 
The false negative rate is the proportion of sites showing a high 
occupancy level of H3K4me3 (a high number of H3K4me3 
ChlP-seq reads, Poisson distribution, P<10~^) despite a very 
low probability of the H3K4me3 occupation [P(H3K4me3) 
<0.00001]. The TSS regions and the first exons show signifi- 
cantly higher false negative rates (12 and 15%) than most 
other regulatory elements. This suggests that the sequence signa- 
ture is not only derived from TSS regions and the first exons but 
also from other genomic regions (Supplementary Fig. S3). 
Although Pol II-, p300- and CTCF-binding sites contain a 
high probability of the H3K4me3 occupation, they still show 
high false negative rates. This suggests that some H3K4me3 oc- 
cupation in these regions is determined not only by sequence 
context but also by epigenetic factors not explained by DNA 
sequence context. 

4 DISCUSSION 

The different sequence preferences of H3K4me3 and H3 nucleo- 
somes imply distinct mechanisms in their arrangement on the 
genome. Although the H3 nucleosome is mainly distributed by 
the bending property of DNA sequences (Segal et al, 2006; 
Vafabakhsh and Ha, 2012), the H3K4me3 nucleosome is regu- 
lated by protein factors recognizing specific sequence compos- 
itions, such as non-methylated CG-rich sequence signatures. The 
sequence-specific protein factors interact with histone methylase 
complexes specific to H3K4me3, such as Setdl. In mouse, inser- 
tion of non-methylated CpG-rich sequences leads to a high dens- 
ity of H3K4me3 occupation. For example, Cfpl specifically 
binds to non-methylated CpG rich sequences and recruits the 



Setdl complex (Thomson et al, 2010). In mouse and human, 
PRDM9, a meiosis-specific H3K4me3 histone methyltransferase, 
recognizes the target sites by the sequence specificity and deposits 
H3K4me3 nucleosomes at meiosis recombination hotspots 
(Baudat et al, 2010). Therefore, the sequence context showing 
high probability of H3K4me3 occupation at important regula- 
tory elements across the genome may attract chromatin modify- 
ing factors. 

We found consistent and distinct sequence specificity only for 
H3K4me3 but not for other histone modifications. It is possible 
that other histone modifications are associated with some se- 
quence specificities that cannot be detected by the 6mer sequence 
preferences. Another possibility is that sequence-independent 
epigenetic factors regulate other histone modifications. Indeed, 
recently Tsai et al. (2010) showed that long non-coding RNAs 
guide the H3K27me3 enrichment, although the sequence specifi- 
city of the non-coding RNAs has not yet been identified. 

Previously, Yuan et al. (2005) also used HMM to estimate 
nucleosome occupation probabilities from microarray data of 
nuclease- treated DNA. Their hidden state was the occupation 
of nucleosome, and the observations were microarray Cy3 inten- 
sities. Their emission probabilities were Cy3 intensities compared 
with the input DNA signal, Cy5. They characterized nucleosome 
location based on the consecutive eight probe intensities. On the 
other hand, our purpose is to compute the occupation probabil- 
ities of different types of nucleosomes from the genomic sequence 
alone. We first showed that the distinctive sequence specificities 
of H3K4me3 and H3 nucleosomes are consistent among differ- 
ent cell types. We then used the observed sequence specificities of 
H3 and H3K4me3 to predict their occupation probabilities at 
each mappable base pair from the genomic sequence alone. In 
our HMM model, the hidden state is the occupation of 
H3K4me3 (or H3) nucleosome at a base pair, and the observed 
state is the genome sequence. The emission probabilities in our 
model are sequence specificities of H4K4me3 (or H3). The pre- 
dicted probabilities of H3K4me3 occupation are consistent with 
experimental data from various human cell types. Although 
ChlP-seq data contain noise, the probability of H3K4me3 oc- 
cupying a genome region can be estimated from the sequence 
reads, as long as the number of reads is large enough for the 
statistics to be meaningful. 

Our model for predicting an H3K4me3 nucleosome map from 
the human genome sequence is similar to Segal et al.'s model 
(2006) in that we assume that nucleosome formation in a 
genome is a dynamic process, and we calculate the nucleosome 
occupation probability at a locus considering all possible nucleo- 
somes that can potentially cover the site and their forward and 
backward configurations in a genome. However, our model dif- 
fers from theirs in at least three aspects. First, our model incorp- 
orates the specific modification status of each nucleosome and 
distinguishes between H3 and H3K4me3 occupations. On the 
other hand, Segal et al. and other studies did not distinguish 
between H3 and H3K4me3 nucleosomes, but considered general 
nucleosomes, i.e. micrococal nuclease- sensitive regions. Second, 
we compute the sequence compositions of DNA segments in H3 
nucleosomes and in H3K4me3 nucleosomes separately, whereas 
Segal et al. computed the composition in extracted DNA seg- 
ments in all forms of nucleosomes after treatment of micrococcal 
nuclease. Third, we aim to infer the probabilistic occupation level 
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of the H3K4me3 nucleosome based on the genome sequence 
context, whereas Segal et al. computed the probabihstic position- 
ing of a base pair within 147 bp nucleosomal DNAs. Our model 
is based on the ChlP-seq experiments. ChlP-seq enables us to 
infer the probable coverage of a locus but not exact positions 
within nucleosomes. Therefore, our model can only estimate the 
probabilistic occupation levels of H3K4me3 and H3 nucleo- 
somes based on the whole-genome sequence context. 

In conclusion, our results underscore a relationship between 
genome sequence and chromatin remodeling. In addition, se- 
quence-independent epigenetic regulation of nucleosomes also 
contributes to chromatin modifications. Our analyses of the 
H3K4me3 sequence signature will be a valuable starting point 
for future studies of the genome structure. Ultimately, the com- 
putational approach may help identify epigenetic and genetic 
factors that regulate the chromatin structure. 
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